In this unit we will learn to quantify the relationship between two numerical variables, as well as modeling numerical response variables using a numerical or categorical explanatory variable.
The scatterplot below shows the relationship between HS graduate rate in all 50 US states and DC and the percent of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).
Response Variable?
Response Variable? Percentage in poverty
Response Variable? Percentage in poverty
Explanatory Variable?
Response Variable? Percentage in poverty
Explanatory Variable? Percentage of HS graduates
Response Variable? Percentage in poverty
Explanatory Variable? Percentage of HS graduates
Relationship?
Response Variable? Percentage in poverty
Explanatory Variable? Percentage of HS graduates
Relationship? Linear, negative, moderately strong
Which of the following is the best guess for the correlation between percentage in poverty and percentage of HS graduates?
Which of the following is the best guess for the correlation between percentage in poverty and percentage of HS graduates?
Which of the following is the best guess for the correlation between percentage in povery and percentage of female householder.
Which of the following is the best guess for the correlation between percentage in povery and percentage of female householder.
Which of the following has the strongest correlation, i.e., the correlation coefficient closest to \(-1\) or \(+1\).
Option (b). While (a) clearly 'tracks', it's not actually linear.
Which of the following appears to be the line that best fits the linear relationship between percentage in poverty and percentage of HS grad? Choose one.
Which of the following appears to be the line that best fits the linear relationship between percentage in poverty and percentage of HS grad? Choose one.
The best fit appears to be (a) (over (d)). (b) and (c) aren't good fits at all.
Residuals are the leftovers from the model fit.
\[ \text{Data} = \text{Fit} + \text{Residuals} \]
Residuals are the difference between the observed (\(y_i\)) and the predicted (\(\hat{y}_i\)). We label these as \(e_i\).
\[ e_i = y_i - \hat{y}_i \]
Note: the other way is not correct! We always take observed minus predicted, not the other way around.
The labeled points indicate that:
Notation:
The slope of the regression line (remember: \(y = mx + b\)) can be calculated as
\[ b_1 = \frac{s_y}{s_x} R \]
In context:
\[ b_1 = \frac{3.1}{3.73} \cdot -0.75 = -0.62 \]
Interpretation: for each additional percentage point in HS graduation rate, we would expect the percentage living in poverty to be lower on average by 0.62% points.
The intercept is where the regression line intersects the y-axis. The calculation of the intercept uses the fact the a regression line always passes through (\(\bar{x}, \bar{y}\)).
\[ b_0 = \bar{y} - b_1 \bar{x} \]
We calculate: \[ b_0 = 11.35 - (-0.62) \cdot 86.01 = 64.68 \]
Which of the following is the correct interpretation of the intercept?
Which of the following is the correct interpretation of the intercept?
Since there are no states in the data set with zero HS graduates, the intercept is of no interest, not very useful, and also not reliable since the predicted value of the intecept is so far from all of the data.
Note: these statements are not causal, unless the study is a randomized controlled experiment.
What condition is this linear model obviously violating?
What condition is this linear model obviously violating?
What condition is this linear model obviously violating?
What condition is this linear model obviously violating?
Which of the below is the correct interpretation of \(R = -0.62, R^2 = 0.38\)
Which of the below is the correct interpretation of \(R = -0.62, R^2 = 0.38\)
Which region (northeast, midwest, west or south) is the reference level?
Which region (northeast, midwest, west or south) is the reference level?
(more to come here)
How do outliers influence the least squares line in this plot?
To answer this question think of where the regression line would be with and without the outlier(s). Without the outliers the regression line would be steeper, and lie closer to the larger group of observations. With the outliers the line is pulled up and away from some of the observations in the larger group.How do outliers influence the least squares line in this plot?
Without the outlier, there is no evident relationship between \(x\) and \(y\).Data are available on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1.
Which of the below best describes the outlier?
Which of the below best describes the outlier?
Which of the following is true?
Which of the following is true?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.20760 9.29990 0.990 0.332
bioIQ 0.90144 0.09633 9.358 1.2e-09
Residual standard error: 7.729 on 25 degrees of freedom
Multiple R-squared: 0.7779, Adjusted R-squared: 0.769
F-statistic: 87.56 on 1 and 25 DF, p-value: 1.204e-09
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.20760 9.29990 0.990 0.332
bioIQ 0.90144 0.09633 9.358 1.2e-09
Residual standard error: 7.729 on 25 degrees of freedom
Multiple R-squared: 0.7779, Adjusted R-squared: 0.769
F-statistic: 87.56 on 1 and 25 DF, p-value: 1.204e-09
Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses?
Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses?
\[ \begin{split} t_\text{test} &= \frac{0.9014 - 0}{0.0963} = 9.36 \\ \end{split} \]
\[ \begin{split} t_\text{test} &= \frac{0.9014 - 0}{0.0963} = 9.36 \\ \text{df} &= 27 - 2 = 25 \\ \end{split} \]
\[ \begin{split} t_\text{test} &= \frac{0.9014 - 0}{0.0963} = 9.36 \\ \text{df} &= 27 - 2 = 25 \\ p-\text{value} &= P\left( |t_\text{test}| > 9.36 \right) < 0.01 \end{split} \]
What can you say about the relationship between % college graduate and % Hispanic in a sample of 100 zip code areas in LA?
What can you say about the relationship between % college graduate and % Hispanic in a sample of 100 zip code areas in LA?
Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?
How reliable is this p-value if these zip code areas are not randomly selected?
Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA?
Yes, since the p-value for % Hispanic is low, this indicates that the data provide convincing evidence that the slope parameter is different to 0.
How reliable is this p-value if these zip code areas are not randomly selected?
Not very …
Remember that a confidence interval is calculated as point estimate \(\pm\) ME and the degrees of freedom associated with the slope in a simple linear regression is \(n - 2\). Which of the below is the correct 95% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins.
\[ \begin{split} n &= 27 \\ \text{df} &= 27 - 2 = 25 \end{split} \]
qt(p = 0.975, df = 25, lower.tail = TRUE)
## [1] 2.059539
\[ \begin{split} n &= 27 \\ \text{df} &= 27 - 2 = 25\\ t_{25}^* &= 2.06 \end{split} \]
Remember that a confidence interval is calculated as point estimate \(\pm\) ME and the degrees of freedom associated with the slope in a simple linear regression is \(n - 2\). Which of the below is the correct 95% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins.